2 research outputs found

    MTrainS: Improving DLRM training efficiency using heterogeneous memories

    Full text link
    Recommendation models are very large, requiring terabytes (TB) of memory during training. In pursuit of better quality, the model size and complexity grow over time, which requires additional training data to avoid overfitting. This model growth demands a large number of resources in data centers. Hence, training efficiency is becoming considerably more important to keep the data center power demand manageable. In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We observe that the bandwidth requirement is not uniform across different tables and that embedding tables show high temporal locality. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically. MTrainS allows for higher memory capacity per node and increases training efficiency by lowering the need to scale out to multiple hosts in memory capacity bound use cases. By optimizing the platform memory hierarchy, we reduce the number of nodes for training by 4-8X, saving power and cost of training while meeting our target training performance

    Efficient Utilization of Heterogeneous Compute and Memory Systems

    Full text link
    Conventional compute and memory systems scaling to achieve higher performance and lower cost and power have diminished. Concurrently, we have diverse compute and memory-demanding workloads that continue to grow and stress traditional systems with only CPUs and DRAM. Heterogeneous compute and memory systems establish the opportunity to boost performance for these demanding workloads by providing hardware units with specialized characteristics. Specialized compute platforms such as GPUs, FPGAs, and accelerators execute specific tasks faster than CPUs, increasing performance and energy efficiency for the particular task. Heterogeneity in the memory systems, such as incorporating various memory technologies like storage class memories (SCMs) alongside DRAM, allows for denser, low-power, and low-cost memories to accommodate data-intensive applications. However, heterogeneous systems have unique characteristics compared to traditional systems. We must carefully design how workloads utilize these units to harness their full benefits. This dissertation presents software and hardware techniques that maximize the performance, energy, and cost-efficiency of heterogeneous systems based on the compute and memory access patterns of various application domains. First, this thesis proposes ChipAdvisor, a machine learning-based framework, to identify the best platform for an application in the early steps of systems design. ChipAdvisor considers the intrinsic characteristics of applications such as parallelism, locality, and synchronization patterns and archives 98% and 94% accuracy in predicting the best performant and energy-efficient platform, respectively, for diverse workloads when considering a system with CPU, GPU, and FPGA. Second, we propose a heterogeneous memory-enabled system design with DRAM and storage class memory (SCM) for key-value stores, one of the largest workloads in data centers. We characterize an extensive deployment of key value stores in a commercial data center and design optimal server configurations with heterogeneous memories. We achieve an 80% performance increase compared to a single-socket platform while reducing the total cost of ownership (TCO) by 43-48% compared to a two-socket platform. Third, this dissertation designs MTrainS, an end-to-end recommendation system trainer that utilizes heterogeneous compute and memory systems. MTrainSefficiently divides recommendation model training tasks between CPUs and GPUs based on the compute patterns. It then hierarchically utilizes various memory types, such as HBM, DRAM, and SCMs, by studying the temporal locality and bandwidth requirements of recommendation system models in data centers. MTrainS reduces the number of hosts used for training by up to 8Ă—, decreasing the power and cost of training. Lastly, this dissertation proposes CoACT, which designs fine-grain cache and memory sharing for collaborative workloads running in integrated CPU-GPU systems. CoACT uses the collaborative pattern of applications to fine-tune cache partitioning and interconnect and memory controller utilization for CPU and GPU, improving performance by 25%.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/175614/1/hiwot_1.pd
    corecore